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Abstract: A novel coronavirus has been identified as the cause of the outbreak of severe acute respiratory syndrome 
(SARS). Previous phylogenetic analyses based on sequence alignments show that SARS-CoVs form a new group distantly 
related to the other three groups of previously characterized coronaviruses. In this aritcle, anew approach based on the 2D 
graphical representation of the whole genome sequence is proposed to analyze the phylogenetic relationships of coron- 
aviruses. The evolutionary distances are obtained through measuring the differences among the two-dimensional curves. 
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Introduction 


The outbreak of atypical pneumonia, referred to as severe acute 
respiratory syndrome (SARS) was first identified in Guangdong 
Province, China, and spread to several countries later. A novel 
coronavirus was isolated and found to be the cause of SARS. The 
SARS-coronavirus is a new member of the order Nidovirales, fam- 
ily Coronaviridae, and genus Coronavirus. Some researchers have 
considered the mutation analysis and phylogenetic analysis.!~© 

Phylogenetic analysis using biological sequences can be divided 
into two groups. The algorithms in the first group calculate a matrix 
representing the distance between each pair of sequences and then 
transform this matrix into a tree. In the second type of approaches, 
instead of building a tree, the tree that can best explain the observed 
sequences under the evolutionary assumption is found by evalu- 
ating the fitness of different topologies. For example, Jukes and 
Cantor,’ Kimura,® Barry and Hartigan,’ Kishino and Hasegawa,!° 
and Lake!! proposed various distance measures. Camin and Sokal, !? 
Eck and Dayhoff, !3 Cavalli-Sforza and Edwards,'* and Fitch!> 
gave parsimony methods. Felsenstein et al.!°—!8 proposed maximum 
likelihood methods. 

But, all of these methods require a multiple alignment of the 
sequences and assume some sort of an evolutionary model. In addi- 
tion to problems in multiple alignment (computational complexity 
and inherent ambiguity of the alignment cost criteria), these meth- 
ods become insufficient for phylogenies using complete genomes. 
Multiple alignment become misleading due to gene rearrangement, 
inversion, transposition, and translocation at the substring level, 
unequal length of sequences, etc, and statistical evolutionary models 


are yet to be suggested for complete genomes. On the other hand, 
whole genome-based phylogenic analyses are appearing because 
single gene sequences generally do not possess enough informa- 
tion to construct an evolutionary history of organisms. Factors 
such as different rates of evolution and horizontal gene transfer 
make phylogenetic analysis of species using single gene sequences 
difficult. 

Mathematical analysis of the large volume genomic DNA 
sequence data is one of the challenges for bioscientists. Graphical 
representation of DNA sequence provides a simple way of view- 
ing, sorting, and comparing various gene structures. In recent years 
several authors outlined different graphical representation of DNA 
sequences based on 2D, 3D, or 4D.'? ** Graphical techniques have 
emerged as a very powerful tool for the visualization and analysis 
of long DNA sequences. These techniques provide useful insights 
into local and global characteristics and the occurrences, variations, 
and repetition of the nucleotides along a sequence that are not as 
easily obtainable by other methods.?*? Based on these graphical 
representation several authors outlined some approaches to make 
comparison of DNA sequences*4~ 8, Recently, we present a new 
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two-dimensional graphical representation of DNA sequences, which 
has no circuit or degeneracy.!? 

Here, a new approach based on the 2D graphical representation 
of the whole genome sequence is proposed to analyze the phylo- 
genetic relationships of genomes. The evolutionary distances are 
obtained through measuring the differences among the 2D curves. 
The examination of the phylogenetic relationships of coronaviruses 
illustrates the utility of our approach. 


2D Graphical Representation of DNA Sequences 


As shown in Figure 1, which is similar with Yan’s** method, we con- 
struct a pyrimidine—purine graph on two quadrants of the cartesian 
coordinate system, with pyrimidines(T and C) in the first quad- 
rant and purines(A and G) in the fourth quadrant. The unit vectors 
representing four nucleotides A,G,C, and T are as follows: 


(m, —/n) =? A, (Jn, —m) ——-. G, (Jn, m) 
— C,(m, J/n) — T 


where m is a real number, n is a positive real number but not a 
perfect square number. Using this representation, we will reduce 
a DNA sequence into a series of nodes Po, Pj, P2,..., Px, whose 
coordinates x;,y;(i = 0,1,2,...,N, where N is the length of the 
DNA sequence being studied) satisfy 


xp = am -+- gi/n + c/n + im 
yi = —aj/n — gim+cjm+ ti/n 


where aj, c;, g; and ft; are the cumulative occurrence numbers of A, 
C, G, and T, respectively, in the subsequence from the first base to 
the i-th base in the sequence. We define ap = co = go = t0 = 0. 
We called the corresponding plot set a characteristic plot set. 
The curve connecting all plots of the characteristic plot set, in turn, 
is called the characteristic curve, which is determined by m,n, that 
satisfy the above mentioned condition. In Figure 2, we show the 
chimpanzee corresponding curves with different parameters n and 


Figure 1. Pyrimidine—purine graph. 


m=1/4, n=3/4 


m=1/2, n=3/4 


Figure 2. The chimpanzee corresponding curves with different param- 
eters n and m. 


m. Observing Figure 2, we find that chimpanzees have similar curves 
despite corresponding different parameters of n and m. They have 
the same tendency despite different lengths. In Figure 3, we present 
the 2D curves for 24 complete coronavirus genomes (see Table 1) 
with parameters n = 1/2 and m = 3/4 chosen initially by Yan 
et al.*4 

Observing Figure 3, we find that the curves of BCoV, BCoVL, 
BCoVM, and BCoVQ have some similar tendencies. The curves of 
MHV2, MHV, MHVM, and MHVP have some similar tendencies. 
The curves of BJO1, CUHK-Su10, CUHK-W1, SIN2679, SIN2748, 
SIN2774, HKU-39849, SIN2500, SIN2677, TW1, Urbani, and 
TOR2 have some similar tendencies. 


Phylogenetic Tree of Coronaviruses 
For any sequence, we have a set of points (x, y;),i = 1,2,3,...,N, 
where N is the length of the sequence. The coordinates of the geo- 


metrical center of the points, denoted by x° and y°, may be calculated 
as follows? 


es ie 

Ou ‘ 0 i 

i= > ty = ) Vie (1) 
NS N& 


The element of the covariance matrix CM of the points are 
defined: 


CM. = BY i — x) (x; — x°) 
CMy = 21 —)0; -»°) = CM,- (2) 
CMyy = 7 1 G1 —Y°)0% — 9) 


The above four numbers give a quantitative description of a set of 
point (xj, y;),i = 1,2,...,N, scattering in a 2D space. Obviously, 
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Figure 3. (A) IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV-229E complete genome. (B) MHV2, 
MHV, MHVM, MHVP, PEDV, TGEV complete genome. (C) BJO1, CUHK-Sul10, CUHK-W1, SIN2679, 
SIN2748, SIN2774 complete genome. (D) HKU-39849, SIN2500, SIN2677, TW1, Urbani, TOR2 com- 
plete genome. The two-dimensional curves for 24 complete coronavirus genomes. (A—D) The curves of 
IBV, BCoV, BCoVL, BCoVM, BCoVQ, HCoV-229E, MHV2, MHV, MHVM, MHVP, PEDV, TGEV, 
BJO1, CUHK-Su10, CUHK-W1, SIN2679, SIN2748, SIN2774, HKU-39849, SIN2500, SIN2677, TW1, 
Urbani, and TOR2, respectively. [Color figure can be viewed in the online issue, which is available at 
www. interscience.wiley.com.] 
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Figure 3. (continued ) 
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Table 1. The Accession Number, Abbreviation, Name, and Length for the 24 Coronavirus Genomes. 


No. Accession Abbreviation Genome Length(nt) 
1 NC_002645 HCoV_229E Human coronavirus 229E 27,317 
2, NC_002306 TGEV Transmissible gastroenteritis virus 28,586 
3 NC_003436 PEDV Porcine epidemic diarrhea virus 28,033 
4 U00735 BCoVM Bovine coronavirus strain Mebus 31,032 
5 AF391542 BCoVL Bovine coronavirus isolate BCoV-LUN 31,028 
6 AF220295 BCoVQ Bovine coronavirus Quebec 31,100 
yi) NC_003045 BCoV Bovine coronavirus 31,028 
8 AF208067 MHVM Murine hepatitis virus strain ML-10 34,233 
9 AF101929 MHV2 Murine hepatitis virus strain 2 31,276 
10 AF208066 MHVP Murine hepatitis virus strain Penn 97-1 31,112 
11 NC_001846 MHV Murine hepatitis virus 31,357 
12 NC_001451 IBV Avian infectious bronchitis virus 27,608 
13 AY278488 BJO1 SARS coronavirus BJO1 29,725 
14 AY278741 Urbani SARS coronavirus Urbani 29,727 
15 AY278491 HKU-39849 SARS coronavirus HKU-39849 29,742 
16 AY278554 CUHK-W1 SARS coronavirus CUHK-W1 29,736 
17 AY282752 CUHK-Su10 SARS coronavirus CUHK-Sul0 29,736 
18 AY283794 SIN2500 SARS coronavirus Sin2500 29,711 
19 AY283795 SIN2677 SARS coronavirus Sin2677 29,705 
20 AY283796 SIN2679 SARS coronavirus Sin2679 29,711 
21 AY283797 SIN2748 SARS coronavirus Sin2748 29,706 
22 AY283798 SIN2774 SARS coronavirus Sin2774 29,711 
23 AY291451 TWl SARS coronavirus TW1 29,729 
24 NC_004718 TOR2 SARS coronavirus 29,751 


Table 2. The Geometric Center and Two Eigenvectors for each of the 24 Coronavirus Genomes. 


0 


0 


i x y EV,, EV}, 

1 8.7251e+003 567.4895 (0.0671,—0.9977) (—0.9977,—0.0671) 
2 9.1181e-+003 231.8617 (0.0265,—0.9996) (—0.9996,—0.0265) 
3 9.1658e+003 854.0672 (0.0891,—0.9960) (—0.9960,—0.0891) 
4 9.847 1e+003 678.7491 (0.0682,—0.9977) (—0.9977,—0.0682) 
5 9.8494e-+003 669.8507 (0.0683,—0.9977) (—0.9977,—0.0683) 
6 9.8708e-+003 671.8188 (0.0678,—0.9977) (—0.9977,—0.0678) 
7 9.8504e+003 667.9839 (0.0684,—0.9977) (—0.9977,—0.0684) 
8 1.0225e-+004 508.6553 (0.0456,—0.9990) (—0.9990,—0.0456) 
9 1.0217e-+004 560.8241 (0.0484,—0.9988) (—0.9988,—0.0484) 
10 1.0166e-+004 571.4215 (0.0492,—0.9988) (—0.9988,—0.0492) 
11 1.0266e-+004 503.3193 (0.0457,—0.9990) (—0.9990,—0.0457) 
12 8.8359e+003 177.6139 (0.0271,—0.9996) (—0.9996,—0.0271) 
13 9.6653e-++003 217.7081 (0.0348,—0.9994) (—0.9994,—0.0348) 
14 9.6644e-++003 220.2759 (0.0347,—0.9994) (—0.9994,—0.0347) 
15 9.6693e-++003 219.4720 (0.0345,—0.9994) (—0.9994,—0.0345) 
16 9.6690e-++003 217.1652 (0.0346,—0.9994) (—0.9994,—0.0346) 
17 9.6687e-+003 217.0494 (0.0346,—0.9994) (—0.9994,—0.0346) 
18 9.6602e-++003 216.5541 (0.0347,—0.9994) (—0.9994,—0.0347) 
19 9.6587e-+003 216.9280 (0.0347,—0.9994) (—0.9994,—0.0347) 
20 9.6601e+003 216.0181 (0.0346,—0.9994) (—0.9994,—0.0346) 
21 9.6583e-+003 216.5654 (0.0347,—0.9994) (—0.9994,—0.0347) 
22 9.6601e-++003 216.0584 (0.0346,—0.9994) (—0.9994,—0.0346) 
23 9.6656e-++003 220.1538 (0.0347,—0.9994) (—0.9994,—0.0347) 
24 9.6724e-++003 219.6501 (0.0346,—0.9994) (—0.9994,—0.0346) 
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the matrix is a real symmetric 2 x 2 one. The eigenvectors and their 
associated eigenvalues are defined as follows: 


CM - EV; = dx EV, EVe = (EVe.1, EVe2)", k = 1,2. 


Corresponding to each eigenvalue A,, there’s an eigenvector 
EV;,.. Corresponding to 4; < 2, the two eigenvectors are denoted 
by EV,,,EV,,, respectively. In Table 2, we list the (x°, y°) and 
eigenvectors belonging to 24 species with parameters m = 5; n= i. 

To facilitate the quantitative comparison of different species 
in terms of their collective parameters, we introduce a distance 
scale and an angle scale as defined below. Suppose that there are 
two species i and j, the parameters are PW A Ab AP WP AL AD, 
respectively, where (x? y?) is the geometrical center of the curve 
belonging to species i. MAS are the two eigenvalues of matrix 
CM; corresponding to species i. The distance dj; between the two 


points is.*? 


dy = J 2 — 2) +09 7 =12...M 8) 


where dj; denotes the distance between the geometric centers of the 
ith and the jth genomes, and M is the total number of all genomes 
(M = 24, here). Then we obtain a real M x M symmetric matrix 
whose elements are dj. 

To reflect the differences between the trends of every two 2D 
curves, the angles between the corresponding eigenvectors of every 
two genomes are used. The 2D vectors are denoted as follows: 


EVi = (EVi,, EVi,)’, inf =1,2,....M,k =A a2. @) 


IBV 
TIGEV 


The angle between the two vectors is denoted as follows: 


. EVi-EVi \ .. 
6;; = arccos —@ +], ij =1,2,...,.M, k=)1, 2. 
IEV;| - EVE 


The sum of Oh over k for given i,j can be used to reflect the trend 
information of the eigenvectors involved 


Oj = 05! + 677, ij =1,2,...,M. (6) 


Consequently, two sets of parameters are obtained. The first 
reflects the difference of center positions represented by the 
Euclidean distance between the geometric centers. The second indi- 
cates the difference of the trends of the 2D curves represented by the 
related eigenvectors. The overall distance Dj; between the species 
i and j is defined by 


Dy = dy x Oy, 6,7 =1,2,...,M. (7) 


Accordingly, a real symmetric M x M matrix Dj is obtained 
and used to reflect the evolutionary distance between the species 
iandj. The clustering tree is constructed using the UPGMA method 
in PHYLIP package (http://evolution.genetics.washington.edu/ 
phylip.html). The final phylogenetic tree is drawn using the 
DRAWGRAM program in the PHYLIP package. In Figure 4, we 
present the phylogenetic tree belonging to 24 species. 
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Figure 4. Phylogenetic tree. 
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Conclusion 


Most existing approaches for phylogenetic inference use multiple 
alignment of sequences and assume some sort of an evolutionary 
model. The multiple alignment strategy does not work for all types 
of data, for example, whole genome phylogeny, and the evolution- 
ary models may not always be correct. Our representation provides 
a direct plotting method to denote DNA sequences without degen- 
eracy. From the DNA graph, the A, T, G, and C usage as well as the 
original DNA sequence can be recaptured mathematically without 
loss of textual information. The current 2D graphical representation 
of DNA sequences provides different approaches for constructing 
the phylogenetic tree. Unlike most existing phylogeny construction 
methods, the proposed method does not require multiple alignment. 
Also, both computational scientists and molecular biologists can use 
it to analysis DNA sequences efficiently with different parameters 
of n and m. 
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